{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![LOGO](../img/MODIN_ver2_hrz.png)\n",
    "\n",
    "<center><h2>Scale your pandas workflows by changing one line of code</h2>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3: Not Implemented\n",
    "\n",
    "**GOAL**: Learn what happens when a function is not yet supported in Modin and what functionality is not possible to accelerate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When functionality has not yet been implemented, we default to pandas\n",
    "\n",
    "![](../img/convert_to_pandas.png)\n",
    "\n",
    "We convert a Modin dataframe to pandas to do the operation, then convert it back once it is finished. These operations will have a high overhead due to the communication involved and will take longer than pandas.\n",
    "\n",
    "When this is happening, a warning will be given to the user to inform them that this operation will take longer than usual. For example, `DataFrame.kurtosis` is not yet implemented. In this case, when a user tries to use it, they will see this warning:\n",
    "\n",
    "```\n",
    "UserWarning: `DataFrame.kurtosis` defaulting to pandas implementation.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Concept for exercise: Default to pandas\n",
    "\n",
    "In this section of the exercise we will see first-hand how the runtime is affected by operations that are not implemented."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import modin.pandas as pd\n",
    "import pandas\n",
    "import numpy as np\n",
    "import time\n",
    "\n",
    "frame_data = np.random.randint(0, 100, size=(2**18, 2**8))\n",
    "df = pd.DataFrame(frame_data).add_prefix(\"col\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pandas_df = pandas.DataFrame(frame_data).add_prefix(\"col\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "modin_start = time.time()\n",
    "\n",
    "print(df.kurtosis())\n",
    "\n",
    "modin_end = time.time()\n",
    "print(\"Modin kurtosis took {} seconds.\".format(round(modin_end - modin_start, 4)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pandas_start = time.time()\n",
    "\n",
    "print(pandas_df.kurtosis())\n",
    "\n",
    "pandas_end = time.time()\n",
    "print(\"pandas kurtosis took {} seconds.\".format(round(pandas_end - pandas_start, 4)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Concept for exercise: Register custom functions\n",
    "\n",
    "Modin's user-facing API is pandas, but it is possible that we do not yet support your favorite or most-needed functionalities. Your user-defined function may also be able to be executed more efficiently if you pre-define the type of function it is (e.g. map, reduction, etc.). To solve either case, it is possible to register a custom function to be applied to your data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Registering a custom function for all query compilers\n",
    "\n",
    "To register a custom function for a query compiler, we first need to import it:\n",
    "\n",
    "```python\n",
    "from modin.backends.pandas.query_compiler import PandasQueryCompiler\n",
    "```\n",
    "\n",
    "The `PandasQueryCompiler` is responsible for defining and compiling the queries that can be operated on by Modin, and is specific to the pandas backend. Any queries defined here must also both be compatible with and result in a `pandas.DataFrame`. Many functionalities are very simply implemented, as you can see in the current code: [Link](https://github.com/modin-project/modin/blob/f15fb8ea776ed039893130b1e85053e875912d4b/modin/backends/pandas/query_compiler.py#L365).\n",
    "\n",
    "If we want to register a new function, we next to understand what kind of function it is. In our example, we will use `kurtosis`, which is a reduction. So we next want to import the function type so we can use it in our definition:\n",
    "\n",
    "```python\n",
    "from modin.data_management.functions import ReductionFunction\n",
    "```\n",
    "\n",
    "Then we can just use the `ReductionFunction.register` `classmethod` and assign it to the `PandasQueryCompiler`:\n",
    "\n",
    "```python\n",
    "PandasQueryCompiler.kurtosis = ReductionFunction.register(pandas.DataFrame.kurtosis)\n",
    "```\n",
    "\n",
    "Finally, we want a handle to it from the `DataFrame`, so we need to create a way to do that:\n",
    "\n",
    "```python\n",
    "def kurtosis_func(self, **kwargs):\n",
    "    # The constructor allows you to pass in a query compiler as a keyword argument\n",
    "    return self.__constructor__(query_compiler=self._query_compiler.kurtosis(**kwargs))\n",
    "\n",
    "pd.DataFrame.kurtosis_custom = kurtosis_func\n",
    "```\n",
    "\n",
    "And then you can use it like you usually would:\n",
    "\n",
    "```python\n",
    "df.kurtosis_custom()\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from modin.backends.pandas.query_compiler import PandasQueryCompiler\n",
    "from modin.data_management.functions import ReductionFunction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PandasQueryCompiler.kurtosis_custom = ReductionFunction.register(pandas.DataFrame.kurtosis)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The function signature came from the pandas documentation:\n",
    "# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.kurtosis.html\n",
    "def kurtosis_func(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs):\n",
    "    # We need to specify the axis for the backend\n",
    "    if axis is None:\n",
    "        axis = 0\n",
    "    # The constructor allows you to pass in a query compiler as a keyword argument\n",
    "    # Reduce dimension is used for reductions\n",
    "    # We also pass all keyword arguments here to ensure correctness\n",
    "    return self._reduce_dimension(\n",
    "        self._query_compiler.kurtosis_custom(\n",
    "            axis=axis, skipna=skipna, level=level, numeric_only=numeric_only, **kwargs\n",
    "        )\n",
    "    )\n",
    "\n",
    "pd.DataFrame.kurtosis_custom = kurtosis_func"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start = time.time()\n",
    "\n",
    "print(df.kurtosis())\n",
    "\n",
    "end = time.time()\n",
    "print(\"Modin kurtosis took {} seconds.\".format(end - start))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start = time.time()\n",
    "\n",
    "print(df.kurtosis_custom())\n",
    "\n",
    "end = time.time()\n",
    "print(\"Modin kurtosis_custom took {} seconds.\".format(end - start))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Congratulations! You have just implemented `DataFrame.kurtosis`!\n",
    "\n",
    "## Consider opening a pull request: https://github.com/modin-project/modin/pulls\n",
    "\n",
    "For a complete list of what is implemented, see the [documentation](https://modin.readthedocs.io/en/latest/UsingPandasonRay/dataframe_supported.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test your knowledge: Add a custom function for another reduction: `DataFrame.mad`\n",
    "\n",
    "See the pandas documentation for the correct signature: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mad.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "modin_mad_start = time.time()\n",
    "\n",
    "# Implement your function here! Put the result of your custom `mad` in the variable `modin_mad`\n",
    "# Hint: Look at the kurtosis walkthrough above\n",
    "modin_mad = ...\n",
    "\n",
    "modin_mad_end = time.time()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Evaluation code, do not change!\n",
    "pandas_mad_start = time.time()\n",
    "pandas_mad = pandas_df.mad()\n",
    "pandas_mad_end = time.time()\n",
    "\n",
    "assert isinstance(modin_mad, pd.Series), \"This is not a distributed Modin object, try again\"\n",
    "assert pandas_mad_end - pandas_mad_start > modin_mad_end - modin_mad_start, \\\n",
    "    \"Your implementation was too slow, or you used the defaulting to pandas approach. Try again\"\n",
    "assert modin_mad._to_pandas().equals(pandas_mad), \"Your result did not match the result of pandas, tray again\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Now that you are able to create custom functions, you know enough to contribute to Modin!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
